Instacart Market Basket Analysis: Leveraging Xplainable Classifier for Transparent Recommendations
Introduction
Welcome to our comprehensive walkthrough of the
Instacart Market Basket Analysis Challenge
on Kaggle. This challenge presents an opportunity for data enthusiasts to dive deep into
the world of grocery shopping and unravel the patterns behind consumer purchase
behavior. Instacart, a prominent online grocery delivery platform, has provided a rich
dataset that includes a information on customer orders through time. The objective? To
predict which previously purchased products will be in a user's next order.
As data scientists and technical experts, we understand the criticality of not just
making accurate predictions but also being able to interpret and explain our models.
This is where our novel approach comes into play. We're introducing the Xplainable
algorithm - a novel machine learning algorithm designed to enhance the transparency and
interpretability of the recommendations.
Enhancing Brand Trust through Two-Way Transparency
An integral aspect of our approach with the Xplainable algorithm is fostering
two-way transparency between the recommender system and the customer. In traditional
recommender systems, users are often left wondering why certain products are recommended
to them. Our method provides the opportunity to bridge this gap by incorporating an
explanatory feature, such as "users like yourself also purchase."
The Importance of Relatable Recommendations
By providing context such as "users like yourself also purchase," we achieve several key
objectives:
- Enhanced User Engagement: When users understand the rationale behind
recommendations, they are more likely to explore and accept these suggestions.
- Increased Personalisation: This approach reflects a deeper understanding of user
behaviour and preferences, leading to more personalised shopping experiences.
- Greater Brand Trust and Loyalty: Transparency in recommendations fosters trust.
When users feel that their needs are understood and catered to, it enhances their
loyalty to the brand.
- Feedback Loop for Continuous Improvement: Such transparent systems encourage user
feedback, providing valuable insights that can be used to further refine and improve
the recommender system.
By implementing a two-way transparent model, we not only elevate the accuracy of our
predictions but also enrich the user experience, instilling a sense of trust and
reliability in the brand. In this walkthrough, we will explore how the Xplainable
algorithm not only achieves high accuracy in predicting the next set of products for
Instacart users but also provides clear insights into the 'why' behind its predictions.
import xplainable as xp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
color = sns.color_palette()
import warnings
warnings.filterwarnings('ignore')
import gc
gc.enable()
print(f"This notebook was created using Xplainable version {xp.__version__}")
Out: This notebook was created using Xplainable version 1.1.1
Load the datasets
It's possible to download the Instacart price prediction dataset at the following link:
https://www.kaggle.com/competitions/instacart-market-basket-analysis/data
Following extraction of the .zip file build the dataset as below:
orders = pd.read_csv('./dataset/orders.csv' )
order_products_train = pd.read_csv('./dataset/order_products__train.csv')
order_products_prior = pd.read_csv('./dataset/order_products__prior.csv')
products = pd.read_csv('./dataset/products.csv')
aisles = pd.read_csv('./dataset/aisles.csv')
departments = pd.read_csv('./dataset/departments.csv')
Inspecting orders dataset
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|
0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan |
1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15 |
2 | 473747 | 1 | prior | 3 | 3 | 12 | 21 |
3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29 |
4 | 431534 | 1 | prior | 5 | 4 | 15 | 28 |
Out: <class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 order_id int64
1 user_id int64
2 eval_set object
3 order_number int64
4 order_dow int64
5 order_hour_of_day int64
6 days_since_prior_order float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB
Out: order_id 0
user_id 0
eval_set 0
order_number 0
order_dow 0
order_hour_of_day 0
days_since_prior_order 206209
dtype: int64
We have observed that there are 206209 missing values in days_since_prior_order column.
Inspecting order_products_train dataset
order_products_train.head()
| order_id | product_id | add_to_cart_order | reordered |
---|
0 | 1 | 49302 | 1 | 1 |
1 | 1 | 11109 | 2 | 1 |
2 | 1 | 10246 | 3 | 0 |
3 | 1 | 49683 | 4 | 0 |
4 | 1 | 43633 | 5 | 1 |
order_products_train.shape
order_products_train.isnull().sum()
Out: order_id 0
product_id 0
add_to_cart_order 0
reordered 0
dtype: int64
Inspecting order_products_prior dataset
order_products_prior.head()
| order_id | product_id | add_to_cart_order | reordered |
---|
0 | 2 | 33120 | 1 | 1 |
1 | 2 | 28985 | 2 | 1 |
2 | 2 | 9327 | 3 | 0 |
3 | 2 | 45918 | 4 | 1 |
4 | 2 | 30035 | 5 | 0 |
order_products_prior.shape
order_products_prior.isnull().sum()
Out: order_id 0
product_id 0
add_to_cart_order 0
reordered 0
dtype: int64
Inspect products dataset
| product_id | product_name | aisle_id | department_id |
---|
0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
1 | 2 | All-Seasons Salt | 104 | 13 |
2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
4 | 5 | Green Chile Anytime Sauce | 5 | 13 |
Out: product_id 0
product_name 0
aisle_id 0
department_id 0
dtype: int64
Inspecting aisles dataset
| aisle_id | aisle |
---|
0 | 1 | prepared soups salads |
1 | 2 | specialty cheeses |
2 | 3 | energy granola bars |
3 | 4 | instant foods |
4 | 5 | marinades meat preparation |
Out: aisle_id 0
aisle 0
dtype: int64
Inspecting departments dataset
| department_id | department |
---|
0 | 1 | frozen |
1 | 2 | other |
2 | 3 | bakery |
3 | 4 | produce |
4 | 5 | alcohol |
departments.isnull().sum()
Out: department_id 0
department 0
dtype: int64
Exploratory Data Analysis (EDA)
plt.figure(figsize=(6,4))
sns.countplot(x="order_dow", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by week day", fontsize=15)
plt.show()
The number of orders on weekends is more compared to weekdays as people stay at home and
might have wanted to enjoy the foods.
plt.figure(figsize=(6,4))
sns.countplot(x="order_hour_of_day", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by Hour of day", fontsize=15)
plt.show()
Peak hours wheremaximum orders are done is between 9 AM- 5PM. Less orders are placed
before 7AM and after 11 PM.
plt.figure(figsize=(10,6))
sns.countplot(orders['days_since_prior_order'])
plt.xticks(rotation=90)
plt.show()
Maximum number of users order again after 1 month. People also order after a week and
this forms the second largest order habit.
products_details = pd.merge(left=products,right=departments,how="left")
products_details = pd.merge(left=products_details,right=aisles,how="left")
products_details.head()
| product_id | product_name | aisle_id | department_id | department | aisle |
---|
0 | 1 | Chocolate Sandwich Cookies | 61 | 19 | snacks | cookies cakes |
1 | 2 | All-Seasons Salt | 104 | 13 | pantry | spices seasonings |
2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 | beverages | tea |
3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 | frozen | frozen meals |
4 | 5 | Green Chile Anytime Sauce | 5 | 13 | pantry | marinades meat preparation |
plt.figure(figsize=(10,6))
g=sns.countplot(x="department",data=products_details)
g.set_xticklabels(g.get_xticklabels(), rotation=40, ha="right")
plt.show()
Personal care is the most abundant type of department available followed by snacks.
plt.figure(figsize=(10,6))
top10_aisle=products_details["aisle"].value_counts()[:10].plot(kind="bar",title='Aisles')
missing is the aisle with most products available.
order_products_name_train = pd.merge(left=order_products_train,right=products.loc[:,["product_id","product_name"]],on="product_id",how="left")
common_Products=order_products_name_train[order_products_name_train.reordered == 1]["product_name"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="product_name", y="index", data=common_Products.head(10))
plt.ylabel('product_name', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()
Banana is the most common type of product bought by people followed by Bag of organic
banana.
order_products_name_train = pd.merge(left=order_products_name_train,right=products_details.loc[:,["product_id","aisle","department"]],on="product_id",how="left")
common_aisle=order_products_name_train["aisle"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="aisle", y="index", data=common_aisle.head(10),palette="Blues_d")
plt.ylabel('aisle', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()
Fresh vegetable aisle has the highest number of sales followed by fresh_fruits.
common_aisle=order_products_name_train["department"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="department", y="index", data=common_aisle,palette="Blues_d")
plt.ylabel('department', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()
produce and dairy eggs are the top 2 departments with the highest number of sales.
train_data_reordered = order_products_train.groupby(["order_id","reordered"])["product_id"].apply(list).reset_index()
train_data_reordered = train_data_reordered[train_data_reordered.reordered == 1].drop(columns=["reordered"]).reset_index(drop=True)
train_data_reordered.head()
| order_id | product_id |
---|
0 | 1 | [49302, 11109, 43633, 22035] |
1 | 36 | [19660, 43086, 46620, 34497, 48679, 46979] |
2 | 38 | [21616] |
3 | 96 | [20574, 40706, 27966, 24489, 39275] |
4 | 98 | [8859, 19731, 43654, 13176, 4357, 37664, 34065... |
Feature Engineering
del products_details
del order_products_name_train
del common_Products
del common_aisle
del train_data_reordered
gc.collect()
aisles['aisle'] = aisles['aisle'].astype('category')
departments['department'] = departments['department'].astype('category')
orders['eval_set'] = orders['eval_set'].astype('category')
products['product_name'] = products['product_name'].astype('category')
prior_orders = pd.merge(orders, order_products_prior, on='order_id', how='inner')
prior_orders.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered |
---|
0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 196 | 1 | 0 |
1 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 14084 | 2 | 0 |
2 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 12427 | 3 | 0 |
3 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26088 | 4 | 0 |
4 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26405 | 5 | 0 |
Create Features using user_id
users = prior_orders.groupby(by='user_id')['order_number'].aggregate('max').to_frame('num_of_orders_for_each_user').reset_index()
users.head()
| user_id | num_of_orders_for_each_user |
---|
0 | 1 | 10 |
1 | 2 | 14 |
2 | 3 | 12 |
3 | 4 | 5 |
4 | 5 | 4 |
toal_product_per_order = prior_orders.groupby(by=['user_id', 'order_id'])['product_id'].aggregate('count').to_frame('total_products_per_order').reset_index()
avg_number_of_products_per_order = toal_product_per_order.groupby(by=['user_id'])['total_products_per_order'].mean().to_frame('avg_no_prd_per_order').reset_index()
del [toal_product_per_order]
gc.collect()
avg_number_of_products_per_order.head()
| user_id | avg_no_prd_per_order |
---|
0 | 1 | 5.9 |
1 | 2 | 13.9286 |
2 | 3 | 7.33333 |
3 | 4 | 3.6 |
4 | 5 | 9.25 |
from scipy import stats
import pandas as pd
import numpy as np
def calculate_mode(x):
if len(x) > 0:
mode_result = stats.mode(x)
if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
return mode_result.mode[0]
else:
return mode_result.mode
else:
return pd.NA
order_most_dow = prior_orders.groupby(by=['user_id'])['order_dow'].aggregate(calculate_mode).to_frame('dow_with_most_orders').reset_index()
order_most_dow.head()
| user_id | dow_with_most_orders |
---|
0 | 1 | 4 |
1 | 2 | 2 |
2 | 3 | 0 |
3 | 4 | 4 |
4 | 5 | 3 |
def calculate_mode_hour(x):
if len(x) > 0:
mode_result = stats.mode(x)
if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
return mode_result.mode[0]
else:
return mode_result.mode
else:
return pd.NA
order_most_hod = prior_orders.groupby(by=['user_id'])['order_hour_of_day'].aggregate(calculate_mode_hour).to_frame('hod_with_most_orders').reset_index()
order_most_hod.head()
| user_id | hod_with_most_orders |
---|
0 | 1 | 7 |
1 | 2 | 9 |
2 | 3 | 16 |
3 | 4 | 15 |
4 | 5 | 18 |
user_reorder_ratio = prior_orders.groupby(by='user_id')['reordered'].aggregate('mean').to_frame('reorder_ratio').reset_index()
user_reorder_ratio['reorder_ratio'] = user_reorder_ratio['reorder_ratio'].astype(np.float16)
user_reorder_ratio.head()
| user_id | reorder_ratio |
---|
0 | 1 | 0.694824 |
1 | 2 | 0.476807 |
2 | 3 | 0.625 |
3 | 4 | 0.055542 |
4 | 5 | 0.378418 |
users = users.merge(avg_number_of_products_per_order, on='user_id', how='left')
users = users.merge(order_most_dow, on='user_id', how='left')
users = users.merge(order_most_hod, on='user_id', how='left')
users = users.merge(user_reorder_ratio, on='user_id', how='left')
users.head()
| user_id | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio |
---|
0 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 |
1 | 2 | 14 | 13.9286 | 13.9286 | 2 | 9 | 0.476807 |
2 | 3 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 |
3 | 4 | 5 | 3.6 | 3.6 | 4 | 15 | 0.055542 |
4 | 5 | 4 | 9.25 | 9.25 | 3 | 18 | 0.378418 |
del [avg_number_of_products_per_order,order_most_dow,order_most_hod,user_reorder_ratio]
gc.collect()
Create features using product_id.
purchased_num_of_times = prior_orders.groupby(by='product_id')['order_id'].aggregate('count').to_frame('purchased_num_of_times').reset_index()
purchased_num_of_times.head()
| product_id | purchased_num_of_times |
---|
0 | 1 | 1852 |
1 | 2 | 90 |
2 | 3 | 277 |
3 | 4 | 329 |
4 | 5 | 15 |
product_reorder_ratio = prior_orders.groupby(by='product_id')['reordered'].aggregate('mean').to_frame('product_reorder_ratio').reset_index()
product_reorder_ratio.head()
| product_id | product_reorder_ratio |
---|
0 | 1 | 0.613391 |
1 | 2 | 0.133333 |
2 | 3 | 0.732852 |
3 | 4 | 0.446809 |
4 | 5 | 0.6 |
add_to_cart = prior_orders.groupby(by='product_id')['add_to_cart_order'].aggregate('mean').to_frame('product_avg_cart_addition').reset_index()
add_to_cart.head()
| product_id | product_avg_cart_addition |
---|
0 | 1 | 5.80184 |
1 | 2 | 9.88889 |
2 | 3 | 6.41516 |
3 | 4 | 9.5076 |
4 | 5 | 6.46667 |
purchased_num_of_times = purchased_num_of_times.merge(product_reorder_ratio, on='product_id', how='left')
purchased_num_of_times = purchased_num_of_times.merge(add_to_cart, on='product_id', how='left')
del [product_reorder_ratio, add_to_cart]
gc.collect()
purchased_num_of_times.head()
| product_id | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition |
---|
0 | 1 | 1852 | 0.613391 | 5.80184 |
1 | 2 | 90 | 0.133333 | 9.88889 |
2 | 3 | 277 | 0.732852 | 6.41516 |
3 | 4 | 329 | 0.446809 | 9.5076 |
4 | 5 | 15 | 0.6 | 6.46667 |
Creating features using user_id and product_id
user_product_data = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_times_bought').reset_index()
user_product_data.head()
| user_id | product_id | uxp_times_bought |
---|
0 | 1 | 196 | 10 |
1 | 1 | 10258 | 9 |
2 | 1 | 10326 | 1 |
3 | 1 | 12427 | 10 |
4 | 1 | 13032 | 3 |
product_first_order_num = prior_orders.groupby(by=['user_id', 'product_id'])['order_number'].aggregate('min').to_frame('first_order_number').reset_index()
product_first_order_num.head()
| user_id | product_id | first_order_number |
---|
0 | 1 | 196 | 1 |
1 | 1 | 10258 | 2 |
2 | 1 | 10326 | 5 |
3 | 1 | 12427 | 1 |
4 | 1 | 13032 | 2 |
total_orders = prior_orders.groupby('user_id')['order_number'].max().to_frame('total_orders').reset_index()
total_orders.head()
| user_id | total_orders |
---|
0 | 1 | 10 |
1 | 2 | 14 |
2 | 3 | 12 |
3 | 4 | 5 |
4 | 5 | 4 |
user_product_df = pd.merge(total_orders, product_first_order_num, on='user_id', how='right')
user_product_df.head()
| user_id | total_orders | product_id | first_order_number |
---|
0 | 1 | 10 | 196 | 1 |
1 | 1 | 10 | 10258 | 2 |
2 | 1 | 10 | 10326 | 5 |
3 | 1 | 10 | 12427 | 1 |
4 | 1 | 10 | 13032 | 2 |
user_product_df['order_range'] = user_product_df['total_orders'] - user_product_df['first_order_number'] + 1
user_product_df.head()
| user_id | total_orders | product_id | first_order_number | order_range |
---|
0 | 1 | 10 | 196 | 1 | 10 |
1 | 1 | 10 | 10258 | 2 | 9 |
2 | 1 | 10 | 10326 | 5 | 6 |
3 | 1 | 10 | 12427 | 1 | 10 |
4 | 1 | 10 | 13032 | 2 | 9 |
number_of_times = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('times_bought').reset_index()
number_of_times.head()
| user_id | product_id | times_bought |
---|
0 | 1 | 196 | 10 |
1 | 1 | 10258 | 9 |
2 | 1 | 10326 | 1 |
3 | 1 | 12427 | 10 |
4 | 1 | 13032 | 3 |
uxp_ratio = pd.merge(number_of_times, user_product_df, on=['user_id', 'product_id'], how='left')
uxp_ratio.head()
| user_id | product_id | times_bought | total_orders | first_order_number | order_range |
---|
0 | 1 | 196 | 10 | 10 | 1 | 10 |
1 | 1 | 10258 | 9 | 10 | 2 | 9 |
2 | 1 | 10326 | 1 | 10 | 5 | 6 |
3 | 1 | 12427 | 10 | 10 | 1 | 10 |
4 | 1 | 13032 | 3 | 10 | 2 | 9 |
uxp_ratio['uxp_reorder_ratio'] = uxp_ratio['times_bought'] / uxp_ratio['order_range']
uxp_ratio.head()
| user_id | product_id | times_bought | total_orders | first_order_number | order_range | uxp_reorder_ratio |
---|
0 | 1 | 196 | 10 | 10 | 1 | 10 | 1 |
1 | 1 | 10258 | 9 | 10 | 2 | 9 | 1 |
2 | 1 | 10326 | 1 | 10 | 5 | 6 | 0.166667 |
3 | 1 | 12427 | 10 | 10 | 1 | 10 | 1 |
4 | 1 | 13032 | 3 | 10 | 2 | 9 | 0.333333 |
uxp_ratio.drop(['times_bought', 'total_orders', 'first_order_number', 'order_range'], axis=1, inplace=True)
uxp_ratio.head()
| user_id | product_id | uxp_reorder_ratio |
---|
0 | 1 | 196 | 1 |
1 | 1 | 10258 | 1 |
2 | 1 | 10326 | 0.166667 |
3 | 1 | 12427 | 1 |
4 | 1 | 13032 | 0.333333 |
user_product_data = user_product_data.merge(uxp_ratio, on=['user_id', 'product_id'], how='left')
del [product_first_order_num, number_of_times,user_product_df,total_orders, uxp_ratio]
gc.collect()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio |
---|
0 | 1 | 196 | 10 | 1 |
1 | 1 | 10258 | 9 | 1 |
2 | 1 | 10326 | 1 | 0.166667 |
3 | 1 | 12427 | 10 | 1 |
4 | 1 | 13032 | 3 | 0.333333 |
prior_orders['order_number_back'] = prior_orders.groupby(by=['user_id'])['order_number'].transform(max) - prior_orders.order_number + 1
prior_orders.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered | order_number_back |
---|
0 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 196 | 1 | 0 | 10 |
1 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 14084 | 2 | 0 | 10 |
2 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 12427 | 3 | 0 | 10 |
3 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26088 | 4 | 0 | 10 |
4 | 2539329 | 1 | prior | 1 | 2 | 8 | nan | 26405 | 5 | 0 | 10 |
temp_df = prior_orders.loc[prior_orders.order_number_back <= 3]
temp_df.head()
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered | order_number_back |
---|
38 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 12427 | 1 | 1 | 3 |
39 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 196 | 2 | 1 | 3 |
40 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 10258 | 3 | 1 | 3 |
41 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 25133 | 4 | 1 | 3 |
42 | 3108588 | 1 | prior | 8 | 1 | 14 | 14 | 46149 | 5 | 0 | 3 |
last_three_order = temp_df.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_last_three').reset_index()
last_three_order.head()
| user_id | product_id | uxp_last_three |
---|
0 | 1 | 196 | 3 |
1 | 1 | 10258 | 3 |
2 | 1 | 12427 | 3 |
3 | 1 | 13032 | 1 |
4 | 1 | 25133 | 3 |
last_three_order['uxp_ratio_last_three'] = last_three_order['uxp_last_three'] / 3
last_three_order.head()
| user_id | product_id | uxp_last_three | uxp_ratio_last_three |
---|
0 | 1 | 196 | 3 | 1 |
1 | 1 | 10258 | 3 | 1 |
2 | 1 | 12427 | 3 | 1 |
3 | 1 | 13032 | 1 | 0.333333 |
4 | 1 | 25133 | 3 | 1 |
user_product_data = user_product_data.merge(last_three_order, on=['user_id', 'product_id'], how='left')
del [last_three_order, temp_df]
gc.collect()
user_product_data.head().head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 |
2 | 1 | 10326 | 1 | 0.166667 | nan | nan |
3 | 1 | 12427 | 10 | 1 | 3 | 1 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 |
user_product_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 8382738
uxp_ratio_last_three 8382738
dtype: int64
user_product_data.fillna(0, inplace=True)
user_product_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
dtype: int64
Create final dataframe for engineered features
featured_engineered_data = user_product_data.merge(users, on='user_id', how='left')
featured_engineered_data = featured_engineered_data.merge(purchased_num_of_times, on='product_id', how='left')
del [users, user_product_data, purchased_num_of_times]
gc.collect()
featured_engineered_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 |
featured_engineered_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
num_of_orders_for_each_user 0
avg_no_prd_per_order_x 0
avg_no_prd_per_order_y 0
dow_with_most_orders 0
hod_with_most_orders 0
reorder_ratio 0
purchased_num_of_times 0
product_reorder_ratio 0
product_avg_cart_addition 0
dtype: int64
Creating Train and Test datasets
Create training dataset
orders_future = orders[((orders.eval_set=='train') | (orders.eval_set=='test'))]
orders_future = orders_future[['user_id', 'eval_set', 'order_id']]
final_data = featured_engineered_data.merge(orders_future, on='user_id', how='left')
final_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 |
train_data = final_data[final_data.eval_set=='train']
train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 |
train_data = train_data.merge(order_products_train[['product_id', 'order_id', 'reordered']], on=['product_id', 'order_id'], how='left')
train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id | reordered |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 | 1 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 | 1 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 | nan |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 | nan |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 | 1 |
train_data.isnull().sum()
Out: user_id 0
product_id 0
uxp_times_bought 0
uxp_reorder_ratio 0
uxp_last_three 0
uxp_ratio_last_three 0
num_of_orders_for_each_user 0
avg_no_prd_per_order_x 0
avg_no_prd_per_order_y 0
dow_with_most_orders 0
hod_with_most_orders 0
reorder_ratio 0
purchased_num_of_times 0
product_reorder_ratio 0
product_avg_cart_addition 0
eval_set 0
order_id 0
reordered 7645837
dtype: int64
train_data['reordered'] = train_data['reordered'].fillna(0)
train_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id | reordered |
---|
0 | 1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | train | 1187899 | 1 |
1 | 1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | train | 1187899 | 1 |
2 | 1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | train | 1187899 | 0 |
3 | 1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | train | 1187899 | 0 |
4 | 1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | train | 1187899 | 1 |
train_data = train_data.set_index(['user_id', 'product_id'])
train_data = train_data.drop(['eval_set', 'order_id'], axis=1)
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 | reordered_Unnamed: 15_level_1 |
---|
1 | 196 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 35791 | 0.77648 | 3.72177 | 1 |
1 | 10258 | 9 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 1946 | 0.713772 | 4.27749 | 1 |
1 | 10326 | 1 | 0.166667 | 0 | 0 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 5526 | 0.652009 | 4.1911 | 0 |
1 | 12427 | 10 | 1 | 3 | 1 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 6476 | 0.740735 | 4.76004 | 0 |
1 | 13032 | 3 | 0.333333 | 1 | 0.333333 | 10 | 5.9 | 5.9 | 4 | 7 | 0.694824 | 3751 | 0.657158 | 5.62277 | 1 |
Create testing dataset
test_data = final_data[final_data.eval_set=='test']
test_data.head()
| user_id | product_id | uxp_times_bought | uxp_reorder_ratio | uxp_last_three | uxp_ratio_last_three | num_of_orders_for_each_user | avg_no_prd_per_order_x | avg_no_prd_per_order_y | dow_with_most_orders | hod_with_most_orders | reorder_ratio | purchased_num_of_times | product_reorder_ratio | product_avg_cart_addition | eval_set | order_id |
---|
120 | 3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 | test | 2774568 |
121 | 3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 | test | 2774568 |
122 | 3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 | test | 2774568 |
123 | 3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 | test | 2774568 |
124 | 3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 | test | 2774568 |
test_data = test_data.set_index(['user_id', 'product_id'])
test_data = test_data.drop(['eval_set', 'order_id'], axis=1)
test_data.head()
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 |
---|
3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 |
3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 |
3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 |
3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 |
3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 |
del [final_data, orders_future, products, order_products_train]
gc.collect()
Building model using Xplainable Classifier
Build X_train and y_train dataset
X_train, y_train = train_data.drop('reordered', axis=1), train_data.reordered
Optimise on 1,000,000 rows
from xplainable.core.optimisation.bayesian import XParamOptimiser
opt = XParamOptimiser()
params = opt.optimise(X_train[:1000000], y_train[:1000000])
Out: 100%|████████| 30/30 [00:45<00:00, 1.53s/trial, best loss: -0.8107380819617527]
Use params from XParamOptimiser()
to fit the xplainable classifier
from xplainable.core.models import XClassifier
model = XClassifier(**params)
model.fit(X_train, y_train)
Out: <xplainable.core.ml.classification.XClassifier at 0x2a426f760>
Model Explanations for Item Recommender
Plot model explanations using the .explain()
method
In the Feature Importances section, we see a list of features with corresponding
importance values. The feature uxp_reorder_ratio
has the highest importance,
indicating that it is the most influential factor in the model's predictions.
On the Contributions side, the uxp_reorder_ratio feature also shows a notable
contribution to the model's output. The green bars represent positive contributions,
while the red bars indicate negative contributions. The specific contribution values are
again not directly visible, but the length and color of the bars suggest that
uxp_reorder_ratio
has a strong positive influence on the model's predictions.
Create model predictions using threshold cutoff NOTE: Adjust the threshold cutoff to
see the impact on the result
test_prediction = (model.predict_proba(test_data) >= 0.21).astype(int)
test_prediction[:5]
Out: array([0, 0, 0, 0, 0])
train_prediction = (model.predict_proba(X_train) >= 0.21).astype(int)
train_prediction[:5]
Out: array([1, 1, 0, 1, 0])
from sklearn.metrics import f1_score, classification_report
print(f'f1 Score: {f1_score(train_prediction, y_train)}')
print(classification_report(train_prediction, y_train))
Out: f1 Score: 0.41808008442097677
precision recall f1-score support
0 0.92 0.94 0.93 7520042
1 0.45 0.39 0.42 954619
accuracy 0.88 8474661
macro avg 0.69 0.66 0.67 8474661
weighted avg 0.87 0.88 0.87 8474661
test_data['prediction'] = test_prediction
test_data.head()
| | uxp_times_bought_Unnamed: 2_level_1 | uxp_reorder_ratio_Unnamed: 3_level_1 | uxp_last_three_Unnamed: 4_level_1 | uxp_ratio_last_three_Unnamed: 5_level_1 | num_of_orders_for_each_user_Unnamed: 6_level_1 | avg_no_prd_per_order_x_Unnamed: 7_level_1 | avg_no_prd_per_order_y_Unnamed: 8_level_1 | dow_with_most_orders_Unnamed: 9_level_1 | hod_with_most_orders_Unnamed: 10_level_1 | reorder_ratio_Unnamed: 11_level_1 | purchased_num_of_times_Unnamed: 12_level_1 | product_reorder_ratio_Unnamed: 13_level_1 | product_avg_cart_addition_Unnamed: 14_level_1 | prediction_Unnamed: 15_level_1 |
---|
3 | 248 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 6371 | 0.400251 | 10.6208 | 0 |
3 | 1005 | 1 | 0.333333 | 1 | 0.333333 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 463 | 0.440605 | 9.49892 | 0 |
3 | 1819 | 3 | 0.333333 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 2424 | 0.492162 | 9.28754 | 0 |
3 | 7503 | 1 | 0.1 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 12474 | 0.553551 | 9.54738 | 0 |
3 | 8021 | 1 | 0.090909 | 0 | 0 | 12 | 7.33333 | 7.33333 | 0 | 16 | 0.625 | 27864 | 0.591157 | 8.82285 | 0 |
final_df = test_data.reset_index()
final_df = final_df[['product_id', 'user_id', 'prediction']]
gc.collect()
final_df.head()
| product_id | user_id | prediction |
---|
0 | 248 | 3 | 0 |
1 | 1005 | 3 | 0 |
2 | 1819 | 3 | 0 |
3 | 7503 | 3 | 0 |
4 | 8021 | 3 | 0 |
Creating the Kaggle submission file (optional)
After developing a robust model and ensuring its performance on our validation set, the
next step is to prepare our submission for Kaggle. Although this step is optional, it is
a good practice to understand how to create a submission file that adheres to the
competition's requirements.
To create a submission file, you typically need to:
- Ensure that your model has been trained with the full training set or with an
appropriate cross-validation strategy.
- Generate predictions for the test set provided by Kaggle.
- Format these predictions into a CSV file that matches the submission format of the
competition, which usually involves setting the index to an
id
column and including
a column with your predictions.
- Use the
to_csv()
function from pandas with the appropriate parameters, such as
index=False
if the index should not be included in the submission file, to save
your dataframe to a CSV file.
- Upload this CSV file to the Kaggle competition's submission page to see how your
model performs on the unseen test set.
See specific steps for the kaggle upload below
orders_test = orders.loc[orders.eval_set == 'test', ['user_id', 'order_id']]
orders_test.head()
| user_id | order_id |
---|
38 | 3 | 2.77457e+06 |
44 | 4 | 329954 |
53 | 6 | 1.52801e+06 |
96 | 11 | 1.37694e+06 |
102 | 12 | 1.35684e+06 |
final_df = final_df.merge(orders_test, on='user_id', how='left')
final_df.head()
| product_id | user_id | prediction | order_id |
---|
0 | 248 | 3 | 0 | 2.77457e+06 |
1 | 1005 | 3 | 0 | 2.77457e+06 |
2 | 1819 | 3 | 0 | 2.77457e+06 |
3 | 7503 | 3 | 0 | 2.77457e+06 |
4 | 8021 | 3 | 0 | 2.77457e+06 |
final_df = final_df.drop('user_id', axis=1)
final_df['product_id'] = final_df.product_id.astype(int)
final_df.head()
| product_id | prediction | order_id |
---|
0 | 248 | 0 | 2.77457e+06 |
1 | 1005 | 0 | 2.77457e+06 |
2 | 1819 | 0 | 2.77457e+06 |
3 | 7503 | 0 | 2.77457e+06 |
4 | 8021 | 0 | 2.77457e+06 |
final_dict = dict()
for row in final_df.itertuples():
if row.prediction== 1:
try:
final_dict[row.order_id] += ' ' + str(row.product_id)
except:
final_dict[row.order_id] = str(row.product_id)
for order in final_df.order_id:
if order not in final_dict:
final_dict[order] = 'None'
gc.collect()
submission_df = pd.DataFrame.from_dict(final_dict, orient='index')
submission_df.reset_index(inplace=True)
submission_df.columns = ['order_id', 'products']
submission_df.head()
| order_id | products |
---|
0 | 2774568 | 17668 18599 21903 39190 43961 47766 |
1 | 1528013 | 21903 38293 |
2 | 1376945 | 8309 13176 14947 20383 27959 33572 35948 44632 |
3 | 1356845 | 5746 7076 8239 10863 11520 13176 14992 |
4 | 2161313 | 196 10441 11266 12427 14715 27839 37710 |
submission_df.to_csv('sub.csv', index=False, header=True)